Time Series Analysis


Freddie Zhang
Mar. 10, 2019


In [88]:
# Import packages
import os
import sys
import fbprophet
import datetime
import holidays
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.dates as mdates
import matplotlib.pyplot as pyplot
from datetime import date
from statistics import mode
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics
from matplotlib import pyplot as plt
from statsmodels.tsa import seasonal
from statsmodels.tsa import stattools
from search_sampler import SearchSampler
from pandas.plotting import autocorrelation_plot
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

Section 0. Brand Information


Brand: Netflix
Freebase ID: /m/017rf_
Source: https://www.wikidata.org


In section 1, I

  1. converted my Netflix google trends data into a stationary format that is acceptable for time series modeling.
  2. I used statsmodels.tsa to decompose the time series.
  3. Visualized the deconstructed components, including: trend, seasonality, and irregularities
In [90]:
# Read in Netflix_googletrends.csv as a DataFrame
Netflix = pd.read_csv('Netflix.googletrends.csv',parse_dates=['period'])
# Convert column of 'period' into DateTime format.
Netflix['period'] = Netflix['period'].dt.date
In [91]:
# Set the column of 'period' as the index of DataFrame.
Netflix.set_index('period',inplace=True)
In [92]:
# Have a glance
Netflix.head(5)
Out[92]:
Unnamed: 0 date value onediffvalue
period
2013-10-07 0 Oct 07 2013 20979.550508 NaN
2013-10-08 1 Oct 08 2013 19372.317342 -1607.233166
2013-10-09 2 Oct 09 2013 19643.282124 270.964782
2013-10-10 3 Oct 10 2013 19588.818874 -54.463250
2013-10-11 4 Oct 11 2013 20986.168158 1397.349284
In [93]:
'''
Convert data to a stationary format that is acceptable for time series modeling by using 
YearLocator() and MonthLocator() from matplotlib.dates library.
'''
years = mdates.YearLocator() # Locate years by matplotlib.dates
months = mdates.MonthLocator() # Locate months by matplotlib.dates
yearmonth = mdates.DateFormatter('%m,%y') # Format Data to mm,yy
In [94]:
# Plot of General Trends (Moving Averages)
fig, ax = pyplot.subplots(figsize=(40,8))

ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(yearmonth)
ax.title.set_text('General Trends')
ax.plot(Netflix.index,Netflix['value'])

plt.show()
In [95]:
'''
Decomposing the time series by using 'seasonal.seasonal_decompose' from 'statsmodels.tsa' 
library.
'''
decompose_model = seasonal.seasonal_decompose(Netflix['value'].tolist(), freq=365, model='additive')
dir(decompose_model)
Out[95]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'nobs',
 'observed',
 'plot',
 'resid',
 'seasonal',
 'trend']
In [96]:
'''
A standard time series equation can be expressed as $x_t = f_t + s_t + c_t + e_t$, which 
is a sum of the trend, seasonal, cyclical, and irregular components in that order. 
Plot of 'API Value', 'Trend', 'Seasonality', 'Residual' versus time.
'''
fig, (ax1, ax2, ax3, ax4) = pyplot.subplots(4, figsize=(15,15))

ax1.plot(Netflix.index, Netflix['value'])
ax2.plot(Netflix.index, decompose_model.trend)
ax3.plot(Netflix.index, decompose_model.seasonal)
ax4.plot(Netflix.index, decompose_model.resid)

ax1.title.set_text('API Value')
ax2.title.set_text('Trend')
ax3.title.set_text('Seasonality')
ax4.title.set_text('Residual Plot')

plt.show()

What is the overall trend? Is the search interest for your brand steadily growing?

According to the visualizations of API Value and Trend shown above, the overall trend is a downward trend from the beginning of 2014 through the end of 2017. The Netflix search interest reaches its peak in mid of 2014 and mid of 2015 while visiting its trough at the beginning of 2017. In general, we could observe a downwards tendency of trends.

For a more comprehensive analysis, we could break the timeline to only one year to avoid short-run effects.

What does the seasonality look like for this brand?

The seasonality plot shows that Netflix arrives at its maximum search interest at the beginning and end of each year. We could also observe another little peak around summer period. Seasonality is driven by the consumers because the search interest of Netflix depends on the spare time of consumers.

In [97]:
# Add a new column of 'seasonal' component from decompose_model into the dataframe.
Netflix['seasonal'] = decompose_model.seasonal
# Have a glance
Netflix.head(5)
Out[97]:
Unnamed: 0 date value onediffvalue seasonal
period
2013-10-07 0 Oct 07 2013 20979.550508 NaN -2179.300695
2013-10-08 1 Oct 08 2013 19372.317342 -1607.233166 -1482.657901
2013-10-09 2 Oct 09 2013 19643.282124 270.964782 466.350198
2013-10-10 3 Oct 10 2013 19588.818874 -54.463250 1031.644542
2013-10-11 4 Oct 11 2013 20986.168158 1397.349284 1388.189943
In [98]:
# Start analyzing the seasonality trend in relatively short run, one-year range.
startdate = pd.to_datetime("2015-01-01").date()
enddate = pd.to_datetime("2015-12-31").date()
oneyear = Netflix.loc[startdate:enddate]
In [99]:
# Plot of Seasonality in 2015
fig, ax = pyplot.subplots(figsize=(15,8))

pyplot.plot(oneyear.index, oneyear.seasonal)
ax.title.set_text('Seasonality in One-year Range')

plt.show()

The plot above shows the seasonality in the one-year range. With a more clear insight, we could observe the Google search interest arrives at its peak at the beginning and end of the year. Moreover, the search interest starts decreasing from January to May and warms up a little bit during the summer. The search interest is unstable from September to November and starts climbing in December.

Section 2. Facebook Prophet for Forecasting


In section 2, I

  1. fitted the model and graph a residual plot to test the accuracy of the model by using Facebook's Prophet forecasting tool.
  2. created a forecast for the next year's google's search interest by using the same toll and create a graph of the forecast
In [101]:
Netflix = pd.read_csv('Netflix.googletrends.csv',parse_dates=['period'])
prophetdataframe = pd.DataFrame() # Create a new prophet
prophetdataframe['ds'] = Netflix['period'] # DataStamp
prophetdataframe['y'] = Netflix['value'] # Value
In [102]:
# Have a glance
prophetdataframe.head(5)
Out[102]:
ds y
0 2013-10-07 20979.550508
1 2013-10-08 19372.317342
2 2013-10-09 19643.282124
3 2013-10-10 19588.818874
4 2013-10-11 20986.168158
In [103]:
# Fit the model by fbprophet. 
m = Prophet()
m.fit(prophetdataframe)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
Out[103]:
<fbprophet.forecaster.Prophet at 0x1c31b45048>
In [104]:
# Create a new dataframe for next one year with dates
future = m.make_future_dataframe(periods = 365) 
# Have a glance
future.tail()
Out[104]:
ds
2181 2019-09-27
2182 2019-09-28
2183 2019-09-29
2184 2019-09-30
2185 2019-10-01
In [105]:
'''
New DataFrame generated by the prediction from fbprophet, including columns of trend, yhat, 
additive terms, weekly, yearly, multiplicative terms(empty) and their ranges.
'''
forecast = m.predict(future)
forecast.tail()
Out[105]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper weekly weekly_lower weekly_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
2181 2019-09-27 20369.812455 15281.410259 21145.725617 18735.467006 22241.341869 -2305.527187 -2305.527187 -2305.527187 -1162.812517 -1162.812517 -1162.812517 -1142.714670 -1142.714670 -1142.714670 0.0 0.0 0.0 18064.285268
2182 2019-09-28 20372.361037 20346.334021 26219.796783 18730.608963 22254.719438 2793.341771 2793.341771 2793.341771 3854.264822 3854.264822 3854.264822 -1060.923052 -1060.923052 -1060.923052 0.0 0.0 0.0 23165.702808
2183 2019-09-29 20374.909620 21572.735124 27440.054563 18725.346779 22267.311382 4013.627179 4013.627179 4013.627179 4980.858450 4980.858450 4980.858450 -967.231271 -967.231271 -967.231271 0.0 0.0 0.0 24388.536800
2184 2019-09-30 20377.458203 15728.690127 21758.448491 18719.614868 22278.190239 -1717.283273 -1717.283273 -1717.283273 -853.269562 -853.269562 -853.269562 -864.013712 -864.013712 -864.013712 0.0 0.0 0.0 18660.174930
2185 2019-10-01 20380.006786 14524.255804 20542.918715 18713.882957 22287.666294 -2927.161767 -2927.161767 -2927.161767 -2173.206551 -2173.206551 -2173.206551 -753.955217 -753.955217 -753.955217 0.0 0.0 0.0 17452.845018
In [106]:
# Model Diagnostics
erroranalytics = m.predict(prophetdataframe)
erroranalytics['value'] = prophetdataframe['y'] # import real value
erroranalytics['residuals'] = erroranalytics['value'] - erroranalytics['yhat'] # compare predict value to real value
erroranalytics['absoluteresiduals'] = abs(erroranalytics['residuals']) # absolute error
# Have a glance
erroranalytics.head()
Out[106]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper weekly ... yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat value residuals absoluteresiduals
0 2013-10-07 24253.506485 21039.075364 25672.945075 24253.506485 24253.506485 -935.867106 -935.867106 -935.867106 -853.269562 ... -82.597544 -82.597544 -82.597544 0.0 0.0 0.0 23317.639379 20979.550508 -2338.088871 2338.088871
1 2013-10-08 24243.191682 19864.427188 24467.541027 24243.191682 24243.191682 -2191.201605 -2191.201605 -2191.201605 -2173.206551 ... -17.995054 -17.995054 -17.995054 0.0 0.0 0.0 22051.990078 19372.317342 -2679.672736 2679.672736
2 2013-10-09 24232.876879 19661.510384 24186.239106 24232.876879 24232.876879 -2304.887576 -2304.887576 -2304.887576 -2334.001433 ... 29.113856 29.113856 29.113856 0.0 0.0 0.0 21927.989303 19643.282124 -2284.707179 2284.707179
3 2013-10-10 24222.562077 19624.441298 24288.204168 24222.562077 24222.562077 -2254.781087 -2254.781087 -2254.781087 -2311.833210 ... 57.052123 57.052123 57.052123 0.0 0.0 0.0 21967.780990 19588.818874 -2378.962116 2378.962116
4 2013-10-11 24212.247274 20864.271577 25378.721595 24212.247274 24212.247274 -1098.177716 -1098.177716 -1098.177716 -1162.812517 ... 64.634801 64.634801 64.634801 0.0 0.0 0.0 23114.069557 20986.168158 -2127.901400 2127.901400

5 rows × 22 columns

In [107]:
# Visualization
fig1 = m.plot(forecast)
In [108]:
# component visualizations
fig2 = m.plot_components(forecast)
In [109]:
# sum of absolute residuals
totalerror = erroranalytics['absoluteresiduals'].sum()
In [110]:
print(totalerror)
2363300.1076838253
In [111]:
# MAE
print(totalerror/len(erroranalytics))
# Meaning: the model was off about 1300 Google trend units every day on average.
1297.8034638571253
In [112]:
erroranalytics['value'].describe()
Out[112]:
count     1821.000000
mean     20097.461182
std       4249.942089
min       9673.065405
25%      16888.614395
50%      19093.643513
75%      22227.758983
max      38175.606745
Name: value, dtype: float64

Section 2.1 Create a forecast using holiday effects

In [113]:
goodholidays = [] # new list for holidays
for date, name in sorted(holidays.US(years=2014).items()): # take an example of 2014
    goodholidays.append(name)
    print(date, name)
2014-01-01 New Year's Day
2014-01-20 Martin Luther King, Jr. Day
2014-02-17 Washington's Birthday
2014-05-26 Memorial Day
2014-07-04 Independence Day
2014-09-01 Labor Day
2014-10-13 Columbus Day
2014-11-11 Veterans Day
2014-11-27 Thanksgiving
2014-12-25 Christmas Day
In [114]:
# new DataFrame
sales = pd.DataFrame(columns=['holiday','ds','lower_window','upper_window'])
In [115]:
for year in range(2014,2021): # major holidays from 2014 to 2020
    for date, name in sorted(holidays.US(years=year).items()):
        lower_window = 0 # reset the 'lower_window'
        upper_window = 0 # reset the 'upper_window'
        if name in goodholidays:
            dayofweek = date.weekday()
        #add additional dates to current holidays
            if name == "New Year's Day":
                lower_window = -1
            if name == "Thanksgiving":
                upper_window = 3
            if name == "Christmas Day":
                lower_window = -5
                upper_window = 6
            if name == "Labor Day":
                lower_window = -3
            if name == "Independence Day":
                upper_window = 3
            sales.loc[len(sales)] = [name, date, lower_window, upper_window]
In [116]:
'''
Add include Superbowl as a typical holiday. Live stream on Netflix for sure!
'''
superbowls = pd.DataFrame({
  'holiday': 'Superbowl',
  'ds': pd.to_datetime(['2014-02-02', '2015-02-01', '2016-02-07',
                       '2017-02-05', '2018-02-04', '2019-02-03',
                       '2020-02-02']),
  'lower_window': 0,
  'upper_window': 1,
})
In [117]:
superbowls['ds'] = superbowls['ds'].dt.date
In [118]:
sales = pd.concat([sales, superbowls],ignore_index=True)
In [119]:
sales = sales.sort_values(['ds'], ascending=[True]).reset_index(drop=True)
# Have a glance
sales.head()
Out[119]:
holiday ds lower_window upper_window
0 New Year's Day 2014-01-01 -1 0
1 Martin Luther King, Jr. Day 2014-01-20 0 0
2 Superbowl 2014-02-02 0 1
3 Washington's Birthday 2014-02-17 0 0
4 Memorial Day 2014-05-26 0 0
In [120]:
adv_prophetdataframe = pd.DataFrame()
adv_prophetdataframe['ds'] = Netflix['period']
adv_prophetdataframe['y'] = Netflix['value']
adv_prophetdataframe.head()
Out[120]:
ds y
0 2013-10-07 20979.550508
1 2013-10-08 19372.317342
2 2013-10-09 19643.282124
3 2013-10-10 19588.818874
4 2013-10-11 20986.168158
In [121]:
adv_m = Prophet(holidays=sales)
adv_m.add_country_holidays(country_name='US')
adv_m.fit(adv_prophetdataframe)
# add Netflix up a show time
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
Out[121]:
<fbprophet.forecaster.Prophet at 0x1c3035b898>
In [122]:
print(adv_m.train_holiday_names)
0                  New Year's Day
1     Martin Luther King, Jr. Day
2                       Superbowl
3           Washington's Birthday
4                    Memorial Day
5                Independence Day
6                       Labor Day
7                    Columbus Day
8                    Veterans Day
9                    Thanksgiving
10                  Christmas Day
11       Christmas Day (Observed)
12      New Year's Day (Observed)
13        Veterans Day (Observed)
14    Independence Day (Observed)
dtype: object
In [123]:
'''
Create new DataFrame for prediction on next one year
'''
adv_future = adv_m.make_future_dataframe(periods = 365)
adv_forecast = adv_m.predict(adv_future)
In [124]:
adv_erroranalytics = adv_m.predict(adv_prophetdataframe)
adv_erroranalytics['value'] = adv_prophetdataframe['y']
adv_erroranalytics['residuals'] = adv_erroranalytics['value'] - adv_erroranalytics['yhat']
adv_erroranalytics['absoluteresiduals'] = abs(adv_erroranalytics['residuals'])
adv_erroranalytics.head()
Out[124]:
ds trend yhat_lower yhat_upper trend_lower trend_upper Christmas Day Christmas Day_lower Christmas Day_upper Christmas Day (Observed) ... yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat value residuals absoluteresiduals
0 2013-10-07 24236.365093 20907.321575 25062.675489 24236.365093 24236.365093 0.0 0.0 0.0 0.0 ... 28.781891 28.781891 28.781891 0.0 0.0 0.0 23111.483470 20979.550508 -2131.932961 2131.932961
1 2013-10-08 24225.796741 20278.063695 24154.509700 24225.796741 24225.796741 0.0 0.0 0.0 0.0 ... 56.151747 56.151747 56.151747 0.0 0.0 0.0 22202.049520 19372.317342 -2829.732178 2829.732178
2 2013-10-09 24215.228388 20062.796087 23876.912388 24215.228388 24215.228388 0.0 0.0 0.0 0.0 ... 68.460592 68.460592 68.460592 0.0 0.0 0.0 22040.139980 19643.282124 -2396.857856 2396.857856
3 2013-10-10 24204.660036 19894.285164 24003.722991 24204.660036 24204.660036 0.0 0.0 0.0 0.0 ... 65.206428 65.206428 65.206428 0.0 0.0 0.0 21950.387871 19588.818874 -2361.568997 2361.568997
4 2013-10-11 24194.091683 21008.293163 25021.448389 24194.091683 24194.091683 0.0 0.0 0.0 0.0 ... 46.269666 46.269666 46.269666 0.0 0.0 0.0 23039.317894 20986.168158 -2053.149737 2053.149737

5 rows × 70 columns

In [125]:
adv_totalerror = adv_erroranalytics['absoluteresiduals'].sum()
In [126]:
print(adv_totalerror)
2053497.7449622154
In [127]:
print(adv_totalerror/len(adv_erroranalytics))
1127.675862142897

How accurate does the model appear to be?

The accuracy of the model is representing by the average prediction error. Lower average prediction error indicates a higher accuracy in general. In this case, the forecast using holiday effects has a lower average prediction error (1127.7 compared with 1297.8) and is more accurate in our case.

In [128]:
fig, ax = pyplot.subplots(figsize=(15,8))

ax.plot(adv_erroranalytics['ds'].dt.to_pydatetime(), adv_erroranalytics['absoluteresiduals'], 'k.', color='#0072B2')
ax.title.set_text('Residual Plot')

plt.show()

Are the residuals consistent across time? If not, why do you think it varies?

Based on the residual plot above, we could find that the residuals are not consistent across time. The residuals tend to be larger at the beginning and end of each year while is it is similar on other occasions. I think the residuals are varied because of some holidays effects and outliers effect. Even though we have included holidays effect in our forecasting, the difference in search interest is still too significant to be minimized.

Section 3. Forecast Next Year's Google's Search Interest

In section 3, I

  1. used the tool to forecast next year's google's search interest.
  2. created a graph of the forecast
In [129]:
# create forecast plot
fig1 = m.plot(forecast)
In [130]:
# create component plot
fig2 = adv_m.plot_components(forecast)

What are the projected high interest times?

Based on the weekly plot, we could find Sunday and Saturday are the times that Google search interests of Netflix are high. The yearly plot also indicates that January is the time that the search interests are high. These projected high-interest times indicate consumers tend to have more free time during the weekends and New Year to watch Netflix. Since some people might have a summer break, we could also observe summer as a relatively active search time.

What are the projected low interest times?

From the same graph, we could observe search interests decrease significantly from Monday and remain low from Tuesday to Thursday. The search interests would increase again from Friday and stay high during the weekend. In the annual aspect, May tends to have the lowest search interests. The search interests decrease from January to February and keep variating from September to November. These patterns indicate consumers are busy during the weekdays and have less time watching Netflix. Also, people are super busy before some big holidays.

Part 4. Advertising Expenditure Data & Granger Causality

In section 4, I

  1. used the adspend data of Netflix and used it the expenditure time series as possible predictors
  2. analyzed whether advertising expenditures drive Google search interest

Section 4.1 Preprocessing AdSpend Data

In [131]:
# load adspend data
adspend = pd.read_csv('netflix.adspend.csv', encoding="ISO-8859-1")
In [132]:
# perform data cleaning
adspend['TIME PERIOD'] = adspend['TIME PERIOD'].str.replace("WEEK OF ","",regex=True)
adspend['TIME PERIOD'] = adspend['TIME PERIOD'].str.replace(" \(B\)", "",regex=True)
adspend.columns = adspend.columns.str.replace(" DOLS \(000\)", "",regex=True)
In [133]:
# Ignore the last 5 rows to avoid NA.
adspend = adspend[:-5]
In [134]:
adspend.tail()
Out[134]:
TIME PERIOD PRODUCT TOTAL NETWORK TV CABLE TV SYNDICATION SPOT TV MAGAZINES SUNDAY MAGS NATL NEWSP NEWSPAPER NETWORK RADIO NAT SPOT RADIO OUTDOOR
1759 OCT 01, 2018 Netflix-Maniac : Online 31.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 31.2
1760 OCT 01, 2018 Netflix-Private Life : Online 421.2 0.0 0.0 0.0 16.1 93.5 0.0 293.6 18.1 0.0 0.0 0.0
1761 OCT 01, 2018 Netflix-Quincy : Online 172.9 0.0 0.0 0.0 0.0 172.9 0.0 0.0 0.0 0.0 0.0 0.0
1762 OCT 01, 2018 Netflix-Super Monsters : Online 2.3 0.0 2.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1763 OCT 01, 2018 Private Life : Movie 27.2 0.0 0.0 0.0 27.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [135]:
# display all the columns names
list(adspend)
Out[135]:
['TIME PERIOD',
 'PRODUCT',
 'TOTAL',
 'NETWORK TV',
 'CABLE TV',
 'SYNDICATION',
 'SPOT TV',
 'MAGAZINES',
 'SUNDAY MAGS',
 'NATL NEWSP',
 'NEWSPAPER',
 'NETWORK RADIO',
 'NAT SPOT RADIO',
 'OUTDOOR']
In [136]:
# convert time period to datetime formate.
adspend['FIXED TIME'] = pd.to_datetime(adspend['TIME PERIOD'])
In [137]:
# aggregated the total ads spend by time
adspendnoproducts = adspend.pivot_table(index = 'FIXED TIME', aggfunc = np.sum)
In [138]:
adspendnoproducts.head()
Out[138]:
CABLE TV MAGAZINES NAT SPOT RADIO NATL NEWSP NETWORK RADIO NETWORK TV NEWSPAPER OUTDOOR SPOT TV SUNDAY MAGS SYNDICATION TOTAL
FIXED TIME
2013-10-07 54.6 0.0 0.0 0.0 0.0 222.9 0.0 0.0 19.8 0.0 263.2 560.5
2013-10-14 21.3 0.0 0.0 0.0 0.0 117.2 0.0 0.0 9.7 0.0 413.5 561.7
2013-10-21 14.1 0.0 0.0 0.0 0.0 145.8 0.0 0.0 10.3 0.0 279.1 449.3
2013-10-28 3.7 0.0 0.0 0.0 0.0 44.7 0.0 233.0 21.5 0.0 326.1 629.1
2013-11-04 13.0 0.0 0.0 0.0 0.0 175.9 0.0 0.0 15.1 0.0 215.3 419.3
In [139]:
list(adspendnoproducts)
Out[139]:
['CABLE TV',
 'MAGAZINES',
 'NAT SPOT RADIO',
 'NATL NEWSP',
 'NETWORK RADIO',
 'NETWORK TV',
 'NEWSPAPER',
 'OUTDOOR',
 'SPOT TV',
 'SUNDAY MAGS',
 'SYNDICATION',
 'TOTAL']
In [140]:
adspendnoproducts.index
Out[140]:
DatetimeIndex(['2013-10-07', '2013-10-14', '2013-10-21', '2013-10-28',
               '2013-11-04', '2013-11-11', '2013-11-18', '2013-11-25',
               '2013-12-02', '2013-12-09',
               ...
               '2018-07-30', '2018-08-06', '2018-08-13', '2018-08-20',
               '2018-08-27', '2018-09-03', '2018-09-10', '2018-09-17',
               '2018-09-24', '2018-10-01'],
              dtype='datetime64[ns]', name='FIXED TIME', length=261, freq=None)
In [141]:
# plot total ads spend 
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, figsize =(15,15), sharex = 'all')

ax1.plot(adspendnoproducts.index, adspendnoproducts['CABLE TV']) 
ax2.plot(adspendnoproducts.index, adspendnoproducts['NETWORK TV']) 
ax3.plot(adspendnoproducts.index, adspendnoproducts['SPOT TV'])  
ax4.plot(adspendnoproducts.index, adspendnoproducts['TOTAL'])

ax1.title.set_text('CABLE TV')
ax2.title.set_text('NETWORK TV')
ax3.title.set_text('SPOT TV')
ax4.title.set_text('TOTAL')

plt.show()
In [142]:
# set period as index
Netflix.set_index('period', inplace=True)
In [143]:
# resample the data and calculate the average
Netflixweekly = Netflix.resample('W-MON', closed='left', label='left').mean()
In [144]:
Netflixweekly.head()
Out[144]:
Unnamed: 0 value onediffvalue
period
2013-10-07 3 23072.559082 1311.731696
2013-10-14 10 22692.250310 -168.587560
2013-10-21 17 22073.755462 -106.178985
2013-10-28 24 21326.559388 259.806443
2013-11-04 31 21788.654927 100.181845
In [145]:
# join weekly Netflix ad spends with aggregated Google trends data.
Netflixweeklydata = adspendnoproducts.join(Netflixweekly)
In [146]:
Netflixweeklydata.head()
Out[146]:
CABLE TV MAGAZINES NAT SPOT RADIO NATL NEWSP NETWORK RADIO NETWORK TV NEWSPAPER OUTDOOR SPOT TV SUNDAY MAGS SYNDICATION TOTAL Unnamed: 0 value onediffvalue
FIXED TIME
2013-10-07 54.6 0.0 0.0 0.0 0.0 222.9 0.0 0.0 19.8 0.0 263.2 560.5 3 23072.559082 1311.731696
2013-10-14 21.3 0.0 0.0 0.0 0.0 117.2 0.0 0.0 9.7 0.0 413.5 561.7 10 22692.250310 -168.587560
2013-10-21 14.1 0.0 0.0 0.0 0.0 145.8 0.0 0.0 10.3 0.0 279.1 449.3 17 22073.755462 -106.178985
2013-10-28 3.7 0.0 0.0 0.0 0.0 44.7 0.0 233.0 21.5 0.0 326.1 629.1 24 21326.559388 259.806443
2013-11-04 13.0 0.0 0.0 0.0 0.0 175.9 0.0 0.0 15.1 0.0 215.3 419.3 31 21788.654927 100.181845
In [147]:
# make a visual comparison of the weekly average Netflix ad spend data.
dimensions = (10, 10)
fig, (ax1, ax2) = plt.subplots(2, figsize = dimensions, sharex = 'all')

ax1.plot(Netflixweeklydata.index, Netflixweeklydata['TOTAL']) 
ax2.plot(Netflixweeklydata.index, Netflixweeklydata['NETWORK TV']) 

ax1.title.set_text('Total')
ax2.title.set_text('NETWORK TV')

plt.show()

Make a correlation test between the total Netflix ad spend and average Google trends data. According to the result, these two factors are correlated at a very small level. To get a proper correlation, we make this data stationary.

In [148]:
# perform correlation test
Netflixweeklydata['TOTAL'].corr(Netflixweeklydata['value'])
# According to the result, these two factors are correlated at a very small level. 
# We need to convert the data to stationary format for time series analysis.
Out[148]:
0.07931662612588285

Calculate one day differences in TOTAL and store the values in a column called TOTALdiff. Also calculate one day differences in value and store the values in a column called valuediff.

In [149]:
# calculate one day differences for Google search interest and ads spend.
Netflixweeklydata['TOTALdiff'] = Netflixweeklydata['TOTAL'].diff(1)
Netflixweeklydata['valuediff'] = Netflixweeklydata['value'].diff(1)
In [150]:
# calculate the correlation
Netflixweeklydata['TOTALdiff'].corr(Netflixweeklydata['valuediff'])
Out[150]:
0.18058165833278841
In [151]:
Netflixweeklydata.to_csv('adspend.googletrends.weekly.csv')

Section 4.2 Granger Causality and Shifting

In [152]:
# read in data
adspend_trends = pd.read_csv('adspend.googletrends.weekly.csv', parse_dates = ['FIXED TIME'])
In [153]:
# set fixed time as index
adspend_trends.set_index('FIXED TIME', inplace = True)
In [154]:
list(adspend_trends)
Out[154]:
['CABLE TV',
 'MAGAZINES',
 'NAT SPOT RADIO',
 'NATL NEWSP',
 'NETWORK RADIO',
 'NETWORK TV',
 'NEWSPAPER',
 'OUTDOOR',
 'SPOT TV',
 'SUNDAY MAGS',
 'SYNDICATION',
 'TOTAL',
 'Unnamed: 0',
 'value',
 'onediffvalue',
 'TOTALdiff',
 'valuediff']
In [155]:
# remove unknown variables we made and inspect the column names again
adspend_trends = adspend_trends.drop(columns = ['onediffvalue', 'Unnamed: 0', 'TOTALdiff', 'valuediff'])
list(adspend_trends)
Out[155]:
['CABLE TV',
 'MAGAZINES',
 'NAT SPOT RADIO',
 'NATL NEWSP',
 'NETWORK RADIO',
 'NETWORK TV',
 'NEWSPAPER',
 'OUTDOOR',
 'SPOT TV',
 'SUNDAY MAGS',
 'SYNDICATION',
 'TOTAL',
 'value']
In [156]:
# use ADF to check the significance of stationarity
for acolumn in list(adspend_trends):
    adf_result = stattools.adfuller(adspend_trends[acolumn])
    print(adf_result[1])
4.944569609526588e-09
0.06237188986361679
0.6888967165130822
0.5507690801310768
0.9967319472189904
5.200387873510148e-06
0.09667874305370416
0.7976617760666814
0.34853519025821755
4.858038849581863e-29
0.2744078405823607
0.24171083022075313
0.05407759564271027
In [157]:
adspend_trends_diff = pd.DataFrame()
In [158]:
# calculate one day difference to see if we can get a stationary dataset
for acolumn in list(adspend_trends):
    columnnames = "%s_diff" % (acolumn)
    adspend_trends_diff[columnnames] = adspend_trends[acolumn].diff(1)
adspend_trends_diff.head()
Out[158]:
CABLE TV_diff MAGAZINES_diff NAT SPOT RADIO_diff NATL NEWSP_diff NETWORK RADIO_diff NETWORK TV_diff NEWSPAPER_diff OUTDOOR_diff SPOT TV_diff SUNDAY MAGS_diff SYNDICATION_diff TOTAL_diff value_diff
FIXED TIME
2013-10-07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-10-14 -33.3 0.0 0.0 0.0 0.0 -105.7 0.0 0.0 -10.1 0.0 150.3 1.2 -380.308772
2013-10-21 -7.2 0.0 0.0 0.0 0.0 28.6 0.0 0.0 0.6 0.0 -134.4 -112.4 -618.494848
2013-10-28 -10.4 0.0 0.0 0.0 0.0 -101.1 0.0 233.0 11.2 0.0 47.0 179.8 -747.196074
2013-11-04 9.3 0.0 0.0 0.0 0.0 131.2 0.0 -233.0 -6.4 0.0 -110.8 -209.8 462.095539
In [159]:
for acolumn in list(adspend_trends_diff):
    adf_result = stattools.adfuller(adspend_trends_diff[acolumn].iloc[1:])
    print(adf_result[1])
2.737565394693417e-16
2.834615515519561e-14
8.44449296505387e-15
5.2624941219073705e-12
6.799586104486564e-06
4.097629531847151e-13
2.9107116033648535e-12
1.3630246836291944e-15
3.5528864600703405e-20
1.507821040229618e-13
1.8385816673648825e-17
7.158438380058312e-05
1.904983788426432e-13
In [160]:
list(adspend_trends_diff)
Out[160]:
['CABLE TV_diff',
 'MAGAZINES_diff',
 'NAT SPOT RADIO_diff',
 'NATL NEWSP_diff',
 'NETWORK RADIO_diff',
 'NETWORK TV_diff',
 'NEWSPAPER_diff',
 'OUTDOOR_diff',
 'SPOT TV_diff',
 'SUNDAY MAGS_diff',
 'SYNDICATION_diff',
 'TOTAL_diff',
 'value_diff']
In [161]:
# calculate correlation between ads spends variables and Google search interests
correlationlist = []
for acolumn in list(adspend_trends_diff):
    if "value" not in acolumn:
        corr = adspend_trends_diff[acolumn].corr(adspend_trends_diff['value_diff'])
        print(acolumn, corr)
        correlationlist.append(corr)
CABLE TV_diff 0.2203641011484586
MAGAZINES_diff 0.05606900447146863
NAT SPOT RADIO_diff 0.02480378803853413
NATL NEWSP_diff 0.004636281702527561
NETWORK RADIO_diff -0.014175420952063408
NETWORK TV_diff 0.09689017500866812
NEWSPAPER_diff 0.14791680031854618
OUTDOOR_diff 0.09041040738031347
SPOT TV_diff 0.011802221229822218
SUNDAY MAGS_diff 0.04533792349282706
SYNDICATION_diff 0.12708521821198435
TOTAL_diff 0.1805816583327881
In [162]:
# calculate the mean of these correlations.
np.mean(correlationlist)
Out[162]:
0.08264351319865625
In [163]:
# calculate the correlation again by using the original data.
correlationlist = []
for acolumn in list(adspend_trends):
    if "value" not in acolumn:
        corr = adspend_trends[acolumn].corr(adspend_trends['value'])
        print(acolumn, corr)
        correlationlist.append(corr)
CABLE TV 0.21732364780401534
MAGAZINES 0.04282481664567579
NAT SPOT RADIO 0.003628544658318102
NATL NEWSP -0.07741452486981847
NETWORK RADIO -0.03582219339670275
NETWORK TV 0.04137759675561025
NEWSPAPER 0.035156521348975484
OUTDOOR -0.0883332173235734
SPOT TV -0.1230979007347311
SUNDAY MAGS -0.017685645562027594
SYNDICATION 0.4945035319256308
TOTAL 0.0793166261258828
In [164]:
np.mean(correlationlist)
Out[164]:
0.047648150281437936
In [165]:
# generate heatmap for correlation
f, ax = plt.subplots(figsize = (10, 8))
corr = adspend_trends_diff.corr()
sns.heatmap(corr, mask = np.zeros_like(corr, dtype = np.bool), 
            cmap = sns.diverging_palette(220, 10, as_cmap = True),
            square = True, ax = ax)
plt.show()
In [166]:
# perform granger causality test
# set the measurement period
numofweeks = 12
significantlags = []
for acolumn in list(adspend_trends_diff):
    if 'value' not in acolumn:
        testframe = adspend_trends_diff[['value_diff', acolumn]]
        testframe = testframe.iloc[1:]
        results = stattools.grangercausalitytests(testframe, numofweeks, verbose = False)
        for week in range(1, numofweeks + 1):
            # set level of significance.
            if results[week][0]['params_ftest'][1] < 0.05:
                print('%s is significnat at %s weeks' % (acolumn, week))
                significantlags.append(week)
CABLE TV_diff is significnat at 1 weeks
NAT SPOT RADIO_diff is significnat at 1 weeks
NAT SPOT RADIO_diff is significnat at 3 weeks
NAT SPOT RADIO_diff is significnat at 4 weeks
NAT SPOT RADIO_diff is significnat at 5 weeks
NAT SPOT RADIO_diff is significnat at 6 weeks
NAT SPOT RADIO_diff is significnat at 7 weeks
NAT SPOT RADIO_diff is significnat at 8 weeks
NAT SPOT RADIO_diff is significnat at 9 weeks
NAT SPOT RADIO_diff is significnat at 10 weeks
NAT SPOT RADIO_diff is significnat at 11 weeks
NAT SPOT RADIO_diff is significnat at 12 weeks
NETWORK RADIO_diff is significnat at 3 weeks
NETWORK RADIO_diff is significnat at 4 weeks
NETWORK RADIO_diff is significnat at 5 weeks
NETWORK RADIO_diff is significnat at 6 weeks
NETWORK RADIO_diff is significnat at 7 weeks
NETWORK RADIO_diff is significnat at 8 weeks
NETWORK RADIO_diff is significnat at 10 weeks
OUTDOOR_diff is significnat at 2 weeks
OUTDOOR_diff is significnat at 3 weeks
OUTDOOR_diff is significnat at 5 weeks
SUNDAY MAGS_diff is significnat at 2 weeks
SUNDAY MAGS_diff is significnat at 3 weeks
SUNDAY MAGS_diff is significnat at 4 weeks
SUNDAY MAGS_diff is significnat at 5 weeks
SUNDAY MAGS_diff is significnat at 6 weeks
SUNDAY MAGS_diff is significnat at 7 weeks
SUNDAY MAGS_diff is significnat at 8 weeks
SUNDAY MAGS_diff is significnat at 9 weeks
SUNDAY MAGS_diff is significnat at 10 weeks
SUNDAY MAGS_diff is significnat at 11 weeks
SUNDAY MAGS_diff is significnat at 12 weeks
SYNDICATION_diff is significnat at 4 weeks
SYNDICATION_diff is significnat at 5 weeks
SYNDICATION_diff is significnat at 11 weeks
SYNDICATION_diff is significnat at 12 weeks
TOTAL_diff is significnat at 5 weeks
TOTAL_diff is significnat at 6 weeks

Do any advertising expenditures drive Google search interest?

Yes, I think CABLE TV and NEWSPAPER could drive the Google search interest. Moreover, NAT SPOT RADIO, NETWORK RADIO, and SYNDICATION also have long term effect on Google search interest. According to the correlation test between ad spend difference and Google search interest difference, we could find CABLE TV and NEWSPAPER have the highest positive correlation with the Google search interest, which suggests increasing the spend on these two categories might also increase the Google search interest. CABLE TV and NEWSPAPER are also effective approaches to advertise Netflix. As the most significant competitor of cable tv, Netflix could show its benefits through a direct comparison on TV.

The Granger causality test couldn't tell correlation directly; however, it indicates the significance of lags, which means the advertising expenditures are still able to drive search interest after seeing the ads for N weeks.

In [167]:
fig, ax = pyplot.subplots(figsize=(15,8))

ax.plot(adspend_trends_diff.index, adspend_trends_diff['TOTAL_diff'])
ax.plot(adspend_trends_diff.index, adspend_trends_diff['value_diff'])

ax.title.set_text('Time Series Visualizations of Ad Spend and Google Search Interest')
ax.legend(('Ad Spend','Google Search Interest'))

plt.show()

Create a time series visualizations that overlay both ad spend and Google search interest, to see if they visually align.

According to the plot above, we could find that the visualization of Google search interest does align with the visualization of ad spend.

Process data for modeling dataset

In [168]:
# The most common lags in the list is 5 weeks.
mode(significantlags)
Out[168]:
5
In [169]:
modelingdataset = adspend_trends_diff
In [170]:
modelingdataset.tail()
Out[170]:
CABLE TV_diff MAGAZINES_diff NAT SPOT RADIO_diff NATL NEWSP_diff NETWORK RADIO_diff NETWORK TV_diff NEWSPAPER_diff OUTDOOR_diff SPOT TV_diff SUNDAY MAGS_diff SYNDICATION_diff TOTAL_diff value_diff
FIXED TIME
2018-09-03 -220.6 -645.5 49.3 0.0 -71.7 1342.4 -39.3 -2511.7 -114.9 0.0 0.0 -2212.0 757.374014
2018-09-10 -18.3 782.7 0.0 727.9 0.0 -28.2 -28.5 0.0 -51.9 0.0 0.0 1383.7 -1960.312593
2018-09-17 187.1 -259.4 0.0 -94.7 0.0 -1051.6 -7.8 0.0 14.0 0.0 7.3 -1205.2 1205.581802
2018-09-24 -180.5 -98.6 0.0 -595.5 0.0 943.4 2.7 0.0 162.2 0.0 56.7 290.4 -1161.818662
2018-10-01 59.1 470.5 -50.7 416.2 0.0 -1078.4 95.6 2763.9 -11.3 0.0 -64.0 2600.7 -1087.065425
In [171]:
# created a 5-week lagged version of data
# attach 5 more weeks to the end of dataframe
date = pd.to_datetime('2018-10-08')
modelingdataset.loc[date] = np.nan
In [172]:
date = pd.to_datetime('2018-10-15')
modelingdataset.loc[date] = np.nan
In [173]:
date = pd.to_datetime('2018-10-22')
modelingdataset.loc[date] = np.nan
In [174]:
date = pd.to_datetime('2018-10-29')
modelingdataset.loc[date] = np.nan
In [175]:
date = pd.to_datetime('2018-11-05')
modelingdataset.loc[date] = np.nan
In [176]:
modelingdataset.tail()
Out[176]:
CABLE TV_diff MAGAZINES_diff NAT SPOT RADIO_diff NATL NEWSP_diff NETWORK RADIO_diff NETWORK TV_diff NEWSPAPER_diff OUTDOOR_diff SPOT TV_diff SUNDAY MAGS_diff SYNDICATION_diff TOTAL_diff value_diff
FIXED TIME
2018-10-08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-10-15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-10-22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-10-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-11-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [177]:
# shift the column down by 5 rows.
for acolumn in list(modelingdataset):
    if 'value' not in acolumn:
        # get lag up to 5
        for alag in range(1,6):
            columnname = '%s_lag%s' % (acolumn, alag)
            modelingdataset[columnname] = modelingdataset[acolumn].shift(alag)
In [178]:
# sort the dataset by the name of column names
modelingdataset.sort_index(axis=1, inplace = True)
modelingdataset.head()
Out[178]:
CABLE TV_diff CABLE TV_diff_lag1 CABLE TV_diff_lag2 CABLE TV_diff_lag3 CABLE TV_diff_lag4 CABLE TV_diff_lag5 MAGAZINES_diff MAGAZINES_diff_lag1 MAGAZINES_diff_lag2 MAGAZINES_diff_lag3 ... SYNDICATION_diff_lag3 SYNDICATION_diff_lag4 SYNDICATION_diff_lag5 TOTAL_diff TOTAL_diff_lag1 TOTAL_diff_lag2 TOTAL_diff_lag3 TOTAL_diff_lag4 TOTAL_diff_lag5 value_diff
FIXED TIME
2013-10-07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-10-14 -33.3 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN ... NaN NaN NaN 1.2 NaN NaN NaN NaN NaN -380.308772
2013-10-21 -7.2 -33.3 NaN NaN NaN NaN 0.0 0.0 NaN NaN ... NaN NaN NaN -112.4 1.2 NaN NaN NaN NaN -618.494848
2013-10-28 -10.4 -7.2 -33.3 NaN NaN NaN 0.0 0.0 0.0 NaN ... NaN NaN NaN 179.8 -112.4 1.2 NaN NaN NaN -747.196074
2013-11-04 9.3 -10.4 -7.2 -33.3 NaN NaN 0.0 0.0 0.0 0.0 ... 150.3 NaN NaN -209.8 179.8 -112.4 1.2 NaN NaN 462.095539

5 rows × 73 columns

In [179]:
modelingdataset.to_csv('google_adspend_differenced_lagged_fordatarobot.csv')

Section 5. Advanced Time Series Analysis Forecasting

In section 5, I

  1. forcasted the Google search interests through neural network
  2. forcasted the Google search interests through lassolarscv
  3. compared the results

Section 5.1 Neural Network

In [180]:
# load final modeling data
alldata = pd.read_csv('google_adspend_differenced_lagged_fordatarobot.csv')
In [181]:
# drop rows contain empty values
alldata = alldata.iloc[6:]
alldata = alldata.iloc[:-5]
In [182]:
alldata['FIXED TIME'] = pd.to_datetime((alldata['FIXED TIME']))

Perform neural network modeling based on the github instructions.

In [183]:
# Gradient descent algorithms perform better if the variables are wihtin range [-1, 1].
# The `value_diff` variable is mixmax scaled to bound the tranformed variable within [-1,1].
scaler = MinMaxScaler(feature_range=(-1, 1))
alldata['scaled_value_diff'] = scaler.fit_transform(np.array(alldata['value_diff']).reshape(-1, 1))
In [184]:
# split data in two parts - train set and validation set
# The neural network is trained on the train set. 
# This means computation of the loss function, back propagation and weights updated by a gradient descent algorithm is done on the train set.
split_date = datetime.datetime(year=2017, month=11, day=10, hour=0)
df_train = alldata.loc[alldata['FIXED TIME']<split_date]
df_val = alldata.loc[alldata['FIXED TIME']>=split_date]
print('Shape of train:', df_train.shape)
print('Shape of test:', df_val.shape)
Shape of train: (208, 75)
Shape of test: (47, 75)
In [185]:
# reset index
df_train.reset_index(drop=True, inplace=True)
df_val.reset_index(drop=True, inplace=True)
In [186]:
# plot train times series of the standardized value_diff
plt.figure(figsize=(15, 6))
g = sns.tsplot(df_train['scaled_value_diff'], color='b')
g.set_title('Time series of Google Trends data differences in train set')
g.set_xlabel('Index')
g.set_ylabel('Google Trends data differences')

# plot validation times series of the standardized value_diff
plt.figure(figsize=(15, 6))
g = sns.tsplot(df_val['scaled_value_diff'], color='r')
g.set_title('Time series of Google Trends data differences in validation set')
g.set_xlabel('Index')
g.set_ylabel('Google Trends data differences')

plt.show()
In [187]:
def makeXy(ts, nb_timesteps):
    X = []
    y = []
    for i in range(nb_timesteps, ts.shape[0]):
        X.append(list(ts.loc[i-nb_timesteps:i-1]))
        y.append(ts.loc[i])
    X, y = np.array(X), np.array(y)
    return X, y
In [188]:
# generate regressors (X) and target variable (y) for train and validation sets
X_train, y_train = makeXy(df_train['scaled_value_diff'], 7)
print('Shape of train arrays:', X_train.shape, y_train.shape)
Shape of train arrays: (201, 7) (201,)
In [189]:
X_val, y_val = makeXy(df_val['scaled_value_diff'], 7)
print('Shape of validation arrays:', X_val.shape, y_val.shape)
Shape of validation arrays: (40, 7) (40,)
In [190]:
# initialize model  
model = Sequential()
model.add(Dense(7, activation='relu', kernel_initializer='uniform', input_shape=(7,)))
# training with noise
layer = GaussianNoise(0.1)
model.add(GaussianNoise(0.01, input_shape=(620,)))
model.add(Dense(16, kernel_initializer='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
WARNING:tensorflow:From /anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
In [191]:
# compile the model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
In [192]:
model.fit(x=X_train, y=y_train, batch_size=201, epochs=100, verbose=1, validation_data=(X_val, y_val), shuffle=True)
WARNING:tensorflow:From /anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 201 samples, validate on 40 samples
Epoch 1/100
201/201 [==============================] - 0s 2ms/step - loss: 0.6931 - acc: 0.0000e+00 - val_loss: 0.6924 - val_acc: 0.0000e+00
Epoch 2/100
201/201 [==============================] - 0s 12us/step - loss: 0.6924 - acc: 0.0000e+00 - val_loss: 0.6917 - val_acc: 0.0000e+00
Epoch 3/100
201/201 [==============================] - 0s 13us/step - loss: 0.6918 - acc: 0.0000e+00 - val_loss: 0.6910 - val_acc: 0.0000e+00
Epoch 4/100
201/201 [==============================] - 0s 13us/step - loss: 0.6911 - acc: 0.0000e+00 - val_loss: 0.6902 - val_acc: 0.0000e+00
Epoch 5/100
201/201 [==============================] - 0s 13us/step - loss: 0.6903 - acc: 0.0000e+00 - val_loss: 0.6895 - val_acc: 0.0000e+00
Epoch 6/100
201/201 [==============================] - 0s 17us/step - loss: 0.6896 - acc: 0.0000e+00 - val_loss: 0.6887 - val_acc: 0.0000e+00
Epoch 7/100
201/201 [==============================] - 0s 12us/step - loss: 0.6888 - acc: 0.0000e+00 - val_loss: 0.6879 - val_acc: 0.0000e+00
Epoch 8/100
201/201 [==============================] - 0s 12us/step - loss: 0.6880 - acc: 0.0000e+00 - val_loss: 0.6871 - val_acc: 0.0000e+00
Epoch 9/100
201/201 [==============================] - 0s 14us/step - loss: 0.6873 - acc: 0.0000e+00 - val_loss: 0.6862 - val_acc: 0.0000e+00
Epoch 10/100
201/201 [==============================] - 0s 15us/step - loss: 0.6865 - acc: 0.0000e+00 - val_loss: 0.6854 - val_acc: 0.0000e+00
Epoch 11/100
201/201 [==============================] - 0s 12us/step - loss: 0.6857 - acc: 0.0000e+00 - val_loss: 0.6845 - val_acc: 0.0000e+00
Epoch 12/100
201/201 [==============================] - 0s 15us/step - loss: 0.6848 - acc: 0.0000e+00 - val_loss: 0.6836 - val_acc: 0.0000e+00
Epoch 13/100
201/201 [==============================] - 0s 15us/step - loss: 0.6840 - acc: 0.0000e+00 - val_loss: 0.6827 - val_acc: 0.0000e+00
Epoch 14/100
201/201 [==============================] - 0s 24us/step - loss: 0.6831 - acc: 0.0000e+00 - val_loss: 0.6818 - val_acc: 0.0000e+00
Epoch 15/100
201/201 [==============================] - 0s 24us/step - loss: 0.6822 - acc: 0.0000e+00 - val_loss: 0.6809 - val_acc: 0.0000e+00
Epoch 16/100
201/201 [==============================] - 0s 19us/step - loss: 0.6813 - acc: 0.0000e+00 - val_loss: 0.6799 - val_acc: 0.0000e+00
Epoch 17/100
201/201 [==============================] - 0s 24us/step - loss: 0.6804 - acc: 0.0000e+00 - val_loss: 0.6789 - val_acc: 0.0000e+00
Epoch 18/100
201/201 [==============================] - 0s 19us/step - loss: 0.6794 - acc: 0.0000e+00 - val_loss: 0.6779 - val_acc: 0.0000e+00
Epoch 19/100
201/201 [==============================] - 0s 27us/step - loss: 0.6784 - acc: 0.0000e+00 - val_loss: 0.6769 - val_acc: 0.0000e+00
Epoch 20/100
201/201 [==============================] - 0s 21us/step - loss: 0.6774 - acc: 0.0000e+00 - val_loss: 0.6758 - val_acc: 0.0000e+00
Epoch 21/100
201/201 [==============================] - 0s 17us/step - loss: 0.6764 - acc: 0.0000e+00 - val_loss: 0.6747 - val_acc: 0.0000e+00
Epoch 22/100
201/201 [==============================] - 0s 14us/step - loss: 0.6754 - acc: 0.0000e+00 - val_loss: 0.6736 - val_acc: 0.0000e+00
Epoch 23/100
201/201 [==============================] - 0s 17us/step - loss: 0.6743 - acc: 0.0000e+00 - val_loss: 0.6725 - val_acc: 0.0000e+00
Epoch 24/100
201/201 [==============================] - 0s 13us/step - loss: 0.6732 - acc: 0.0000e+00 - val_loss: 0.6713 - val_acc: 0.0000e+00
Epoch 25/100
201/201 [==============================] - 0s 16us/step - loss: 0.6721 - acc: 0.0000e+00 - val_loss: 0.6701 - val_acc: 0.0000e+00
Epoch 26/100
201/201 [==============================] - 0s 28us/step - loss: 0.6709 - acc: 0.0000e+00 - val_loss: 0.6688 - val_acc: 0.0000e+00
Epoch 27/100
201/201 [==============================] - 0s 42us/step - loss: 0.6697 - acc: 0.0000e+00 - val_loss: 0.6675 - val_acc: 0.0000e+00
Epoch 28/100
201/201 [==============================] - 0s 29us/step - loss: 0.6684 - acc: 0.0000e+00 - val_loss: 0.6662 - val_acc: 0.0000e+00
Epoch 29/100
201/201 [==============================] - 0s 21us/step - loss: 0.6672 - acc: 0.0000e+00 - val_loss: 0.6649 - val_acc: 0.0000e+00
Epoch 30/100
201/201 [==============================] - 0s 16us/step - loss: 0.6659 - acc: 0.0000e+00 - val_loss: 0.6635 - val_acc: 0.0000e+00
Epoch 31/100
201/201 [==============================] - 0s 23us/step - loss: 0.6645 - acc: 0.0000e+00 - val_loss: 0.6621 - val_acc: 0.0000e+00
Epoch 32/100
201/201 [==============================] - 0s 22us/step - loss: 0.6632 - acc: 0.0000e+00 - val_loss: 0.6606 - val_acc: 0.0000e+00
Epoch 33/100
201/201 [==============================] - 0s 13us/step - loss: 0.6617 - acc: 0.0000e+00 - val_loss: 0.6591 - val_acc: 0.0000e+00
Epoch 34/100
201/201 [==============================] - 0s 14us/step - loss: 0.6603 - acc: 0.0000e+00 - val_loss: 0.6575 - val_acc: 0.0000e+00
Epoch 35/100
201/201 [==============================] - 0s 13us/step - loss: 0.6588 - acc: 0.0000e+00 - val_loss: 0.6559 - val_acc: 0.0000e+00
Epoch 36/100
201/201 [==============================] - 0s 14us/step - loss: 0.6572 - acc: 0.0000e+00 - val_loss: 0.6543 - val_acc: 0.0000e+00
Epoch 37/100
201/201 [==============================] - 0s 14us/step - loss: 0.6556 - acc: 0.0000e+00 - val_loss: 0.6526 - val_acc: 0.0000e+00
Epoch 38/100
201/201 [==============================] - 0s 13us/step - loss: 0.6540 - acc: 0.0000e+00 - val_loss: 0.6509 - val_acc: 0.0000e+00
Epoch 39/100
201/201 [==============================] - 0s 13us/step - loss: 0.6524 - acc: 0.0000e+00 - val_loss: 0.6491 - val_acc: 0.0000e+00
Epoch 40/100
201/201 [==============================] - 0s 13us/step - loss: 0.6506 - acc: 0.0000e+00 - val_loss: 0.6473 - val_acc: 0.0000e+00
Epoch 41/100
201/201 [==============================] - 0s 14us/step - loss: 0.6489 - acc: 0.0000e+00 - val_loss: 0.6454 - val_acc: 0.0000e+00
Epoch 42/100
201/201 [==============================] - 0s 13us/step - loss: 0.6471 - acc: 0.0000e+00 - val_loss: 0.6435 - val_acc: 0.0000e+00
Epoch 43/100
201/201 [==============================] - 0s 14us/step - loss: 0.6452 - acc: 0.0000e+00 - val_loss: 0.6415 - val_acc: 0.0000e+00
Epoch 44/100
201/201 [==============================] - 0s 19us/step - loss: 0.6432 - acc: 0.0000e+00 - val_loss: 0.6394 - val_acc: 0.0000e+00
Epoch 45/100
201/201 [==============================] - 0s 24us/step - loss: 0.6413 - acc: 0.0000e+00 - val_loss: 0.6373 - val_acc: 0.0000e+00
Epoch 46/100
201/201 [==============================] - 0s 18us/step - loss: 0.6392 - acc: 0.0000e+00 - val_loss: 0.6351 - val_acc: 0.0000e+00
Epoch 47/100
201/201 [==============================] - 0s 28us/step - loss: 0.6371 - acc: 0.0000e+00 - val_loss: 0.6329 - val_acc: 0.0000e+00
Epoch 48/100
201/201 [==============================] - 0s 16us/step - loss: 0.6350 - acc: 0.0000e+00 - val_loss: 0.6306 - val_acc: 0.0000e+00
Epoch 49/100
201/201 [==============================] - 0s 14us/step - loss: 0.6327 - acc: 0.0000e+00 - val_loss: 0.6283 - val_acc: 0.0000e+00
Epoch 50/100
201/201 [==============================] - 0s 22us/step - loss: 0.6305 - acc: 0.0000e+00 - val_loss: 0.6258 - val_acc: 0.0000e+00
Epoch 51/100
201/201 [==============================] - 0s 20us/step - loss: 0.6282 - acc: 0.0000e+00 - val_loss: 0.6233 - val_acc: 0.0000e+00
Epoch 52/100
201/201 [==============================] - 0s 18us/step - loss: 0.6258 - acc: 0.0000e+00 - val_loss: 0.6208 - val_acc: 0.0000e+00
Epoch 53/100
201/201 [==============================] - 0s 19us/step - loss: 0.6232 - acc: 0.0000e+00 - val_loss: 0.6181 - val_acc: 0.0000e+00
Epoch 54/100
201/201 [==============================] - 0s 15us/step - loss: 0.6207 - acc: 0.0000e+00 - val_loss: 0.6155 - val_acc: 0.0000e+00
Epoch 55/100
201/201 [==============================] - 0s 16us/step - loss: 0.6181 - acc: 0.0000e+00 - val_loss: 0.6127 - val_acc: 0.0000e+00
Epoch 56/100
201/201 [==============================] - 0s 16us/step - loss: 0.6154 - acc: 0.0000e+00 - val_loss: 0.6098 - val_acc: 0.0000e+00
Epoch 57/100
201/201 [==============================] - 0s 15us/step - loss: 0.6127 - acc: 0.0000e+00 - val_loss: 0.6069 - val_acc: 0.0000e+00
Epoch 58/100
201/201 [==============================] - 0s 15us/step - loss: 0.6101 - acc: 0.0000e+00 - val_loss: 0.6040 - val_acc: 0.0000e+00
Epoch 59/100
201/201 [==============================] - 0s 13us/step - loss: 0.6071 - acc: 0.0000e+00 - val_loss: 0.6009 - val_acc: 0.0000e+00
Epoch 60/100
201/201 [==============================] - 0s 12us/step - loss: 0.6041 - acc: 0.0000e+00 - val_loss: 0.5977 - val_acc: 0.0000e+00
Epoch 61/100
201/201 [==============================] - 0s 15us/step - loss: 0.6009 - acc: 0.0000e+00 - val_loss: 0.5945 - val_acc: 0.0000e+00
Epoch 62/100
201/201 [==============================] - 0s 14us/step - loss: 0.5979 - acc: 0.0000e+00 - val_loss: 0.5912 - val_acc: 0.0000e+00
Epoch 63/100
201/201 [==============================] - 0s 16us/step - loss: 0.5945 - acc: 0.0000e+00 - val_loss: 0.5878 - val_acc: 0.0000e+00
Epoch 64/100
201/201 [==============================] - 0s 15us/step - loss: 0.5914 - acc: 0.0000e+00 - val_loss: 0.5843 - val_acc: 0.0000e+00
Epoch 65/100
201/201 [==============================] - 0s 15us/step - loss: 0.5880 - acc: 0.0000e+00 - val_loss: 0.5807 - val_acc: 0.0000e+00
Epoch 66/100
201/201 [==============================] - 0s 13us/step - loss: 0.5847 - acc: 0.0000e+00 - val_loss: 0.5771 - val_acc: 0.0000e+00
Epoch 67/100
201/201 [==============================] - 0s 13us/step - loss: 0.5811 - acc: 0.0000e+00 - val_loss: 0.5734 - val_acc: 0.0000e+00
Epoch 68/100
201/201 [==============================] - 0s 16us/step - loss: 0.5775 - acc: 0.0000e+00 - val_loss: 0.5695 - val_acc: 0.0000e+00
Epoch 69/100
201/201 [==============================] - 0s 16us/step - loss: 0.5739 - acc: 0.0000e+00 - val_loss: 0.5656 - val_acc: 0.0000e+00
Epoch 70/100
201/201 [==============================] - 0s 16us/step - loss: 0.5702 - acc: 0.0000e+00 - val_loss: 0.5616 - val_acc: 0.0000e+00
Epoch 71/100
201/201 [==============================] - 0s 14us/step - loss: 0.5661 - acc: 0.0000e+00 - val_loss: 0.5576 - val_acc: 0.0000e+00
Epoch 72/100
201/201 [==============================] - 0s 14us/step - loss: 0.5623 - acc: 0.0000e+00 - val_loss: 0.5534 - val_acc: 0.0000e+00
Epoch 73/100
201/201 [==============================] - 0s 19us/step - loss: 0.5584 - acc: 0.0000e+00 - val_loss: 0.5491 - val_acc: 0.0000e+00
Epoch 74/100
201/201 [==============================] - 0s 14us/step - loss: 0.5540 - acc: 0.0000e+00 - val_loss: 0.5448 - val_acc: 0.0000e+00
Epoch 75/100
201/201 [==============================] - 0s 14us/step - loss: 0.5499 - acc: 0.0000e+00 - val_loss: 0.5403 - val_acc: 0.0000e+00
Epoch 76/100
201/201 [==============================] - 0s 14us/step - loss: 0.5456 - acc: 0.0000e+00 - val_loss: 0.5358 - val_acc: 0.0000e+00
Epoch 77/100
201/201 [==============================] - 0s 17us/step - loss: 0.5413 - acc: 0.0000e+00 - val_loss: 0.5311 - val_acc: 0.0000e+00
Epoch 78/100
201/201 [==============================] - 0s 20us/step - loss: 0.5368 - acc: 0.0000e+00 - val_loss: 0.5264 - val_acc: 0.0000e+00
Epoch 79/100
201/201 [==============================] - 0s 15us/step - loss: 0.5321 - acc: 0.0000e+00 - val_loss: 0.5216 - val_acc: 0.0000e+00
Epoch 80/100
201/201 [==============================] - 0s 18us/step - loss: 0.5275 - acc: 0.0000e+00 - val_loss: 0.5166 - val_acc: 0.0000e+00
Epoch 81/100
201/201 [==============================] - 0s 17us/step - loss: 0.5231 - acc: 0.0000e+00 - val_loss: 0.5116 - val_acc: 0.0000e+00
Epoch 82/100
201/201 [==============================] - 0s 21us/step - loss: 0.5180 - acc: 0.0000e+00 - val_loss: 0.5065 - val_acc: 0.0000e+00
Epoch 83/100
201/201 [==============================] - 0s 22us/step - loss: 0.5132 - acc: 0.0000e+00 - val_loss: 0.5013 - val_acc: 0.0000e+00
Epoch 84/100
201/201 [==============================] - 0s 20us/step - loss: 0.5082 - acc: 0.0000e+00 - val_loss: 0.4960 - val_acc: 0.0000e+00
Epoch 85/100
201/201 [==============================] - 0s 19us/step - loss: 0.5027 - acc: 0.0000e+00 - val_loss: 0.4906 - val_acc: 0.0000e+00
Epoch 86/100
201/201 [==============================] - 0s 24us/step - loss: 0.4976 - acc: 0.0000e+00 - val_loss: 0.4852 - val_acc: 0.0000e+00
Epoch 87/100
201/201 [==============================] - 0s 16us/step - loss: 0.4925 - acc: 0.0000e+00 - val_loss: 0.4797 - val_acc: 0.0000e+00
Epoch 88/100
201/201 [==============================] - 0s 26us/step - loss: 0.4870 - acc: 0.0000e+00 - val_loss: 0.4740 - val_acc: 0.0000e+00
Epoch 89/100
201/201 [==============================] - 0s 13us/step - loss: 0.4818 - acc: 0.0000e+00 - val_loss: 0.4683 - val_acc: 0.0000e+00
Epoch 90/100
201/201 [==============================] - 0s 17us/step - loss: 0.4763 - acc: 0.0000e+00 - val_loss: 0.4625 - val_acc: 0.0000e+00
Epoch 91/100
201/201 [==============================] - 0s 14us/step - loss: 0.4707 - acc: 0.0000e+00 - val_loss: 0.4566 - val_acc: 0.0000e+00
Epoch 92/100
201/201 [==============================] - 0s 22us/step - loss: 0.4652 - acc: 0.0000e+00 - val_loss: 0.4506 - val_acc: 0.0000e+00
Epoch 93/100
201/201 [==============================] - 0s 34us/step - loss: 0.4596 - acc: 0.0000e+00 - val_loss: 0.4446 - val_acc: 0.0000e+00
Epoch 94/100
201/201 [==============================] - 0s 17us/step - loss: 0.4537 - acc: 0.0000e+00 - val_loss: 0.4384 - val_acc: 0.0000e+00
Epoch 95/100
201/201 [==============================] - 0s 17us/step - loss: 0.4474 - acc: 0.0000e+00 - val_loss: 0.4322 - val_acc: 0.0000e+00
Epoch 96/100
201/201 [==============================] - 0s 18us/step - loss: 0.4418 - acc: 0.0000e+00 - val_loss: 0.4259 - val_acc: 0.0000e+00
Epoch 97/100
201/201 [==============================] - 0s 16us/step - loss: 0.4358 - acc: 0.0000e+00 - val_loss: 0.4196 - val_acc: 0.0000e+00
Epoch 98/100
201/201 [==============================] - 0s 16us/step - loss: 0.4296 - acc: 0.0000e+00 - val_loss: 0.4132 - val_acc: 0.0000e+00
Epoch 99/100
201/201 [==============================] - 0s 12us/step - loss: 0.4233 - acc: 0.0000e+00 - val_loss: 0.4067 - val_acc: 0.0000e+00
Epoch 100/100
201/201 [==============================] - 0s 21us/step - loss: 0.4169 - acc: 0.0000e+00 - val_loss: 0.4001 - val_acc: 0.0000e+00
Out[192]:
<keras.callbacks.History at 0x1c3166fc88>
In [193]:
# predict the results
preds = model.predict(X_val)
In [194]:
# calculate R^2
r2 = r2_score(df_val['scaled_value_diff'].loc[7:], preds)
print('R-squared for the validation set:', round(r2,4))
R-squared for the validation set: -9.7842

Section 5.2 LassoLarsCV


Applying LassoLarsCV work from class

In [195]:
alldata.set_index(alldata['FIXED TIME'], inplace=True)
alldata = alldata.drop(columns=['FIXED TIME'])
In [196]:
listofallpredictors = []
for i in list(alldata):
    if "value" not in i:
        listofallpredictors.append(i)
In [197]:
predictors = alldata[listofallpredictors]
In [198]:
target = alldata['value_diff']
In [199]:
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)
In [200]:
model = LassoLarsCV(cv=10, precompute=False)
In [201]:
model = model.fit(pred_train.values, tar_train.values)
In [202]:
predictors_model = pd.DataFrame(listofallpredictors)
predictors_model.columns = ['label']
predictors_model['coeff'] = model.coef_
In [203]:
for index, row in predictors_model.iterrows():
    if row['coeff'] > 0:
        print(row.values)
['CABLE TV_diff' 0.1285734995019148]
['CABLE TV_diff_lag1' 0.15987756252672494]
['NAT SPOT RADIO_diff_lag1' 0.8078840763738762]
['NETWORK RADIO_diff_lag1' 0.06786399492262714]
['NETWORK TV_diff_lag5' 0.1336582158383376]
['NEWSPAPER_diff' 0.053342876501953115]
['SUNDAY MAGS_diff_lag3' 3.8581864776861896]
['SYNDICATION_diff' 0.2710473033753247]
['TOTAL_diff' 0.05915094829297049]
In [204]:
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print('test data MSE')
print(test_error)
test data MSE
2651567.986036804
In [205]:
rsquared_test = model.score(pred_test, tar_test)
print('test data r-square')
print(rsquared_test)
test data r-square
-0.06699879197221481

Assess the degree to which your advanced approach outperforms the methods taught in class.

Both of my approaches are not a good fit for my data. The LassoLarsCV has R^2 as -0.067 and Neural Network has R^2 as -9.784. Negative R^2 indicates the fitted model is even worse than a straight line.

  1. I think the Neural Network is not a good fit because the dataset is too small. Deep learning requires a relatively large dataset, and the model will always overfit the small dataset.
  2. I could get a better LassoLarsCV result if I set the lags as 2. My model has too many predictors for logistic regression, which creates a lot of error when fitting the model.
  3. The best approach to solve these issues might be improving the training set.
In [ ]: